The Subscriber Segmenteer

by Andrew Yu, Norman Chen, and Cathy Chen.


Check it out on GitHub.

Executive Summary


Background & Objective

In the space of content creation, subscribers rule above all else. With the release of new publishing platforms like Anchor (podcasts) and Substack (newsletters), a new wave of content creators is starting to emerge. As content niches mature, content creators will need to more heavily compete for advertisement deals and sponsorships, and in order to do so, it is imperative that they understand who their subscribers are. However, these new publishing platforms aren’t equipped with the right tools to help creators properly break down their audience by increasingly niche segmentation criteria. This makes it hard for creators to curate good content for their audience, and it’s doubly difficult to secure advertisement deals if they can only guess at their audience base.

We hope to identify a scalable model that can quickly and accurately categorize subscribers of entrepreneurship-focused content into useful categories (e.g. investors vs startup operators vs students). Ideally, this tool should be:

  1. cost-free considering that many creators are not yet monetized
  2. relatively accurate and granular, and
  3. repeatable and easy to run OR continuously and automatically updated.

Data

To build this model, we used a redacted subscriber snapshot of Climate Tech VC, a rapidly-growing Substack newsletter focused on climate innovation. The starting dataset consists 8662 website URLs and can be found here.

In order to transform this data into meaningful information,we pulled the text from the homepage of each website. The results of the initial scrape and cleaning can be found here.

After processing this scraped data, we end with a dataset of all subscribers with English websites and the plaintext from their website's homepage, which can be found here.

Findings

For our applications, K-means clustering actually offers a very powerful tool for subscriber segmentation. A simple Term Frequency, Inverse Document Frequency (TF-IDF) transformation provides ample information to meaningfully cluster our documents.

The LDA topic modeling was able to provide similarly salient segments, but because LDA is a more complex model than K-Means, we find that the simpler solution is likely the better solution.


Part 0: Notebook Setup


Part 1: Initial Data, Processing, & EDA


1.1 Starting with Websites

As previously mentioned, content creators often only have the list of emails for their audience (e.g. through their mailing list), plus a few additional fields that they may or may not have included in their audience onboarding process.

For this exploration, we'll use a snapshot of the subscriber base of a Substack newsletter focused on climate tech. We've already exported this data, so we'll need to load it in via Google Drive.

Taking a look at our data, it's clear that there are two big issues:

  1. Domain names alone isn't very much data to analyze, and segmentation would likely be very difficult.
  2. There are several "repeats" of websites in the data. To see the extent of these repeats, let's take a look at the frequencies of these websites.

We have a lot of repeats in this subscriber list! We don't want to have to parse each of these multiple times, so let's get unique ones, only.

1.2 Getting More Data

In order to make meaningful customer clusters, we need more data. The only data we have access to is this list of websites - from here we can take only a handful of approaches.

  1. If we know there's a comprehensive database of websites and classficiations, we could directly query that database.
  2. If we can't find or access such a database, we can instead go directly to the website and extract more information from each website.

Because we don't have access to Option 1, we instead chose Option 2: scraping text from all the websites.

Note: Because scraping takes such a long time, especially with larger datasets, we've omitted the scraping from this notebook and instead have exported the results of our initial scrape and preprocessing here. We've instead opted to outline our process for the scrape in text.

1.2.1 Initial Scrape

In order to create meaningful data out of website URLs, we scraped plaintext from each website using the requests package, and we cleaned the data using the BeautifulSoup4 package. To speed up processes, we multithreaded our scrape.

import requests
from bs4 import BeautifulSoup
import htmlmetadata
from htmlmetadata import extract_metadata
from lxml import html
from requests_html import HTMLSession
import concurrent.futures
import time

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
MAX_THREADS = 1000

The first function makes helps us get the strings from an HTML document.

# Clean up Data with BeautifulSoup4
# @param html_code is a byte-type html document, generated typically by a requests.content call.
# @returns a list of "stripped strings"; BeautifulSoup4 separates out HTML code into different tags, and the text is extracted from each tag.
def get_strips(html_code):
  soup = BeautifulSoup(html_code, 'lxml')

  for script in soup(['script','style']):
      script.decompose()

  strips = list(soup.stripped_strings)
  return strips

The second function calls the actual request to get data from the website.

# Scrape data from a given site. 
# @param site is the URL of the website to be scraped, ind is the index of the site in relation to the dataframe that this function is called on
# @return a whole host of data scraped from the website, including error logging, metadata description, metadata language category, and the "strips." 
def scrape_site(site, ind):
  print(site)

  # Generate potential URLs to loop through
  url_starts = ['https://', 'http://', 'https://www.', 'http://www.']
  urls = [start + site for start in url_starts]

  # Define Return Variables
  meta_description = ''
  meta_language = ''
  html_code = ''
  strips = []
  errors = []

  for url in urls:
    try:
      page = requests.get(url, timeout = 5, headers = headers)
      meta = extract_metadata(url)
      meta_description = meta['summary'].get('description')
      meta_language = meta['summary'].get('language')
      html_code = page.content
      strips = get_strips(html_code)
      break
    except Exception as exc:
      errors.append(exc)
  return meta_description, meta_language, str(html_code), strips, errors, ind

And this third function allows us to concurrently run the first two functions on every website in a dataframe.

# Concurrently call scrape_site on all sites across a dataframe
#@param a dataframe that has been properly set up with error, description, language, html_code, and strips columns.
#@return a dataframe populated by all site scrapes.
def scrape_all_sites(df):
  with concurrent.futures.ThreadPoolExecutor(max_workers = MAX_THREADS) as executor:
    future_load_site = {executor.submit(scrape_site, df['url'][ind], ind): ind for ind in df.index}
    for future in concurrent.futures.as_completed(future_load_site):
        url = future_load_site[future]
        try:
          meta_description, meta_language, html_code, strips, errors, ind = future.result()
          df['description'][ind] = meta_description
          df['language'][ind] = meta_language
          df['html_code'] = html_code
          df['strips'][ind] = strips
        except Exception as exc:
          df['error'][ind].append(exc)
  return df

1.2.2 Initial Preprocessing

Once the data was consolidated into a dataframe, processing_df, we did basic regular expression extraction to start the preprocessing.

# Scrape the Sites
processing_df = scrape_all_sites(websites_df)

# Consolidate Strips into Text and Remove Non-Words
processing_df['text'] = processing_df['strips'].apply(lambda x: re.sub(r"@[^\w\s]", '',' '.join(x)))

# Lowercase Everything
processing_df['text']= processing_df['text'].map(lambda x: x.lower())

And then we exported the dataframe using the Pandas .to_csv() function. You can find the exported data here.

1.3 Processing the Data

After scraping the data, we need to process it in several different ways:

  1. Additional Cleanup (removing punctuation, leftover HTML code, etc.)
  2. Language Detection (removing websites that aren't in English, since the packages we'll be using are focused on the English language)
  3. Lemmatization (oftentimes, we'll have similar words that share roots or are variations on the same word. We want to combine those all together)
  4. Remove Stopwords (removing words that are too common and might skew our analyses)

We'll start with scraped_text_df, a dataframe that resulted from concurrently querying every website for their metadata description and language tag (using the htmlmetadata package) and raw text from each website (using the BeautifulSoup4 package).

1.3.1 Additional Cleanup

Because we just loaded this dataset in from Github, the detault datatype for every column is "object."

We need to ensure that all of our columns are of type string. However, in Python strings are default stored as Object datatypes due to their variable length. In order to conserve memory, the dtypes won't change at face value, but when the data is processed and certain methods are called they'll be properly inferred to be strings.

Next, let's go ahead and combine meta and text, as long as one of the two exist.

There's still a little bit of additional HTML, let's remove that with the html2text package.

We'll also get rid of additional punctuation using the string.translate function.

1.3.2 Language Classfication

Furthermore, you'll have noticed that some of the websites aren't in English. While it would be fascinating to explore websites in other languages, it makes clustering a little difficult, so we'll drop all those websites that aren't in English.

To do so, we can rely on both the metadata descriptions for languages, and in the absence of those descriptions, we can instead use the spacy LanguageDetector to determine whether or not something is in English.

Now, we can drop websites that aren't classified as English.

It's important to note that we just dropped nearly a thousand websites. This newsletter sure has a lot of foreign followers.

This particular process can takes quite a while, too, so we've exported it here.

1.3.3 Lemmatization

Lastly, we want to reduce words to their core meaning and lumping together words that essentially mean the same thing (i.e. companies vs company). There are two appraoches we could take here: (1) stemming (removing suffixes in order to group words by root, resulting in word stems like compan for companies, company and even companion) and (2) lemmatization (reducing words to a standard lemma, bringing companies and company both to company and leaving companion alone).

In order to make for a user-friendly experience, we prefer lemmatization.

We'll use the NLTK WordNetLemmatizer in order to lemmatize our data.

1.3.4 Removing Stopwords

Now that we've got cleaned, lemmatized, English-only documents, it's time to wrap up our processing by ripping out all the stopwords that will clutter up our data.

We'll use gensim simple_preprocess, which lowercases, tokenizes, and de-accents words and the NLTK stopwords list to remove those stopwords from the tokenized list.

1.4 Verifying Our Preprocessing

Now that we've done so much preprocessing, let's take a look at our data.

At this point, we can get rid of all the extra fields and come back to just our new, processed text. We also exported this as a CSV using the Pandas to_csv() function, which you can access here.

However, there is one final step. We don't want websites that have a single instance overpowering websites that have several subscribers, so we re-join our dataset with our initial website list in order to account for frequency. Again, we exported this as a CSV using the Pandas to_csv() function, which you can access here.

Interestingly, the word frequencies change drastically after we re-weight / allow for duplicate websites. Microsoft Outlook makes a heavy appearance here.

Part 2: Clustering


With our new cleaned documents, we begin segmenting our readers with various methods. We first try K-Means clustering, and then we try Topic Modeling with Latent Dirchlet Allocation.

2.1 K-Means Clustering

The first thing that comes to mind when we consider clustering is K-Means clustering. Unfortunately, we can't directly compute Euclidean distance between words, so we need to conduct some feature engineering in order to conduct K-Means clustering.

We opted to use TF-IDF (or Term Frequency-Inverse Document Frequency) to compute the importance of a word in a corpus, with the most relevant terms being the most important. The more times a word appears within a document, the stronger that word is (hence Term Frequency), but the more documents the word appears in, the less unique / salient that word is (hence Inverse Document Frequency).

2.1.1 Selecting the Right Data

Because of the way Inverse Document Frequency works, it's best for us to use TF-IDF on our english_cleaned_unique_df dataframe, rather than dataframes that have high repeats. Otherwise, all words associated with Cornell University might be unfairly underweighted. Sorry, Cornell.

2.1.2 Computing TF-IDF

To compute TF-IDF, we use the sklearn TfidfVectorizer.

Note that in the end, we want to categorize each website with a cluster, so we will run k-means on tf_idf instead of its transpose. In the table above, each entry represents the tf-idf value of the word (the column) in the particular website (the row).

Note on TF and IDF:

2.1.3 Conducting K-Means

Now that we have the tf-idf values, we can calculate distance. This means we can categorize each website by the combination of the tf-idf values of every words.

The first step of k-means is to determine our k. To do this, we use the "elbow method" to see the change in the sum of squared distances as we increase the number of clusters.

We can see that the dip is around when k=13. We will use this value for our number of clusters and train our final k-means model, km.

2.1.4 Visualizing K-Means

Now that we have our model, we'll want to do the following:

  1. Vizualize our K-Means clusters;
  2. Create word clouds for our clusters; and
  3. Try to label our clusters ourselves.

First, let's define some functions to extract the top words of each cluster (get_top_features_cluster) and vizualize those words with a barplot (plotWords).

We can see some very clear trends for words within given clusters. To further vizualize this we, use word clouds.

Here the patterns emerge. Based off the word clouds, these are our guesses.

Cluster Number Our Best Guess
0 Sign-In Text for Personal Emails
1 Businesses / Startups
2 Geographic Locations and News
3 Climate Change (Carbon / Emissions Focus)
4 Microsoft & Outlook Users
5 High Finance & Later Stage Finance
6 Personal Domains / Email Forwarding / Error Messages
7 Venture Capital
8 Random Bits of Javascript and HTML
9 Students / Academia
10 Carbon Capture and Emissions
11 Energy / Renewables
12 More Bits of Javascript and HTML

2.1.5 Putting K-Means to the Test

Now that we have our K-Means model, we use these trained assignments to assign each of our documents (each website) to one of these clusters.

Next, we can map these numeric cluster labels to our "best guesses".

With that completed, we can take a look at the distribution of unique sites by our newly named clusters.

Lastly, if we want to look at our overall subscriber base, we ought to consider all of our subscribers, which includes the repeats.

It's interesting to note how much the personal email clusters inflate when we include repeats; this is exactly what we would hope to expect.

We can see that the following clusters dominate:

  1. Businesses / Startups (General)
  2. Students / Academia
  3. Venture Capital
  4. Personal Emails & Microsoft and Outlook Users & Sites That Provide Random Javascript

This makes sense considering that the newsletter this dataset is from focuses on Climate Tech, which is the intersection of sustainability and entrepreneurship. However, the overwhelming dominance of Businesses / Startups is interesting. Let's explore this a little deeper.

This actually makes a lot of sense - generalist businesses, consulting firms, and accelerators all have broader business language in thier website and aren't nearly as targeted as venture capital. Furthermore, we see that is is hyperdispersed (i.e. there's a very loooong tail for this particular category), which is exactly what we'd expect. Nice!

2.1.6 Thoughts on K-Means

In general, we can see that our clusters are fairly acurate. In each cluster, we can see that the words are indeed coorelated to each other - suggesting that our metric - tf-idf - is a solid metric to calculate distance for k-means clusters. Using the final distribution of cluster, we can see that Climate Tech VC's subscribers generally align with their goals, focused on climate tech startups, students, and venture capitalists.

Overall, K-Means has a few distinct benefits as well as a few drawbacks. While K-Means clustering is a strong unsupervised learning method, traditional implementations are definitive / exclusive. That is to say, a given word or document is assigned to one cluster and one cluster only. Oftentimes, we'll find words that appear in multiple documents (especially words that might have different definitions depending on context). While K-Means is a strong solution for our particular problem, we explore a probabilistic approach using Latent Dirchlet Allocation.

2.2 Topic Modeling with Latent Dirichlet Allocation


In order to account for the fact that not all words fall nicely into one bucket or the other, and neither do companies, we can try a probabilistic approach to clustering, instead.

We find a potential solution in Latent Dirchlet Allocation (LDA). Without going to into depth (read more here), the LDA model effectively assumes the following:

  1. Length of documents follow some Poisson distribution
  2. Documents are a mixture of a fixed number of K "topics" that follows some Dirchelet distribution over those K topics.
  3. Words have their own probability of being in each topic (so they can show up in multiple topics).
  4. Thus, documents are functionally generated by having some random length, then generating documents by randomly pulling words from topics based off the assumed distribution of topics the document is based on.

2.2.1 Data Revisited

For LDA, words can fit in multiple topics, which means that we may want to remove words that appear too much. Instead of computing TF-IDF to do this, we instead seek to extract bigrams like "venture capital" and remove terms that might appear everywhere (i.e. company, email, etc.).

We'll create bigrams in order to capture phrases like "venture capital" and "climate tech."

Now, our docs should also have bigrams.

At this point, we have several words that should be common across all documents, as well as words that only appear once or twice and never again. We remove those using the gensim.corpora Dictionary package.

And then lastly, let's check the overall features of our data again.

2.2.2 Finding the Ideal Topic Number

Now that we've organized our data, the next thing we need to do is make sure we choose the right number of topics.

First, we assign a few important variables that will be used for all our models (i.e. corpus and id2word)

Given what we've learned from K-means clustering and the optimal 12 clusters, we can expect that there should be more than 5 topics but likely no more than 20. So we can use our function to search for topic count with the highest coherence.

It seems like generally, maximum coherence is achieved at 12 topics as well, so we will also create 12 topics for our final model.

2.2.3 A 12-Topic LDA Model

With the data prepared, we can try running an LDA model to see if we can come up with salient topics. We've already set up important variables, so the only thing left is to run the model for 12 topics.

With the model run, we look at the topics put out by the model.

We can also calculate the coherence of the model.

And we can visualize the topics with pyLDAvis. While we can't render it here, you can access it on GitHub here

Many of the topics are actually very similar to the topics generated by K-Means. For example, Topic 2 is related to academia, and Topic 10 is mainly miscellaneous Javascript and login credentials.

Note: to see this vizualization, you'll need to run the ipynb version of this notebook yourself. We haven't quite figured out how to get this visualization to render in the HTML export.

2.2.4 Segmentation Using LDA

Now that we have a model trained, we can use it to generate topic probability distributions for each website. For example, we can generate the probability of topics for Cornell University, shown below.

Overall, the topics separated out by our LDA model are interesting but not as intuitive as in our K-Means clustering. As our objective is to make a tool that can be easily used by content-creators, K-Means clustering is a better bet.